100 research outputs found
Consensus Graph Representation Learning for Better Grounded Image Captioning
The contemporary visual captioning models frequently hallucinate objects that
are not actually in a scene, due to the visual misclassification or
over-reliance on priors that resulting in the semantic inconsistency between
the visual information and the target lexical words. The most common way is to
encourage the captioning model to dynamically link generated object words or
phrases to appropriate regions of the image, i.e., the grounded image
captioning (GIC). However, GIC utilizes an auxiliary task (grounding objects)
that has not solved the key issue of object hallucination, i.e., the semantic
inconsistency. In this paper, we take a novel perspective on the issue above -
exploiting the semantic coherency between the visual and language modalities.
Specifically, we propose the Consensus Rraph Representation Learning framework
(CGRL) for GIC that incorporates a consensus representation into the grounded
captioning pipeline. The consensus is learned by aligning the visual graph
(e.g., scene graph) to the language graph that consider both the nodes and
edges in a graph. With the aligned consensus, the captioning model can capture
both the correct linguistic characteristics and visual relevance, and then
grounding appropriate image regions further. We validate the effectiveness of
our model, with a significant decline in object hallucination (-9% CHAIRi) on
the Flickr30k Entities dataset. Besides, our CGRL also evaluated by several
automatic metrics and human evaluation, the results indicate that the proposed
approach can simultaneously improve the performance of image captioning (+2.9
Cider) and grounding (+2.3 F1LOC).Comment: 9 pages, 5 figures, AAAI 202
A Category Classification Based Safety Risk Assessment Method for Railway Wagon Loading Status
The identification and control of safety risks in the loading state of goods wagon is one of the important tasks to ensure the safety of goods in transit. In view of the problem that the current risk assessment of transportation schemes is mainly based on manual experience and cannot be quantified, which makes it difficult to accurately determine the safety risk of transportation on the way, a risk assessment method for loading status of goods wagon based on scenario classification was proposed. Firstly, based on a detailed analysis of the safety risk points in various stages of railway freight operations, a SHEL influencing factor model based on scenario classification was constructed. Then, considering the characteristics of railway freight transportation, a fuzzy accident tree model (FTA) of goods wagon loading state risk was constructed, and the fault tree was transformed into a Bayesian network structure according to the mapping algorithm of fuzzy fault tree and Bayesian. Furthermore, a triangular fuzzy membership function was introduced to describe the fault probability of nodes, and a BN based fuzzy fault tree inference algorithm was proposed. Finally, taking a railway station and route transporting coil steel goods in China as an example, this paper explained how to integrate expert knowledge through fault tree and Bayesian network to support railway freight scheme designers in conducting risk quantification assessment of freight wagon loading status
Learning in Imperfect Environment: Multi-Label Classification with Long-Tailed Distribution and Partial Labels
Conventional multi-label classification (MLC) methods assume that all samples
are fully labeled and identically distributed. Unfortunately, this assumption
is unrealistic in large-scale MLC data that has long-tailed (LT) distribution
and partial labels (PL). To address the problem, we introduce a novel task,
Partial labeling and Long-Tailed Multi-Label Classification (PLT-MLC), to
jointly consider the above two imperfect learning environments. Not
surprisingly, we find that most LT-MLC and PL-MLC approaches fail to solve the
PLT-MLC, resulting in significant performance degradation on the two proposed
PLT-MLC benchmarks. Therefore, we propose an end-to-end learning framework:
\textbf{CO}rrection \textbf{M}odificat\textbf{I}on
balan\textbf{C}e, abbreviated as \textbf{\method{}}. Our bootstrapping
philosophy is to simultaneously correct the missing labels (Correction) with
convinced prediction confidence over a class-aware threshold and to learn from
these recall labels during training. We next propose a novel multi-focal
modifier loss that simultaneously addresses head-tail imbalance and
positive-negative imbalance to adaptively modify the attention to different
samples (Modification) under the LT class distribution. In addition, we develop
a balanced training strategy by distilling the model's learning effect from
head and tail samples, and thus design a balanced classifier (Balance)
conditioned on the head and tail learning effect to maintain stable performance
for all samples. Our experimental study shows that the proposed \method{}
significantly outperforms general MLC, LT-MLC and PL-MLC methods in terms of
effectiveness and robustness on our newly created PLT-MLC datasets
IDEAL: Toward High-efficiency Device-Cloud Collaborative and Dynamic Recommendation System
Recommendation systems have shown great potential to solve the information
explosion problem and enhance user experience in various online applications,
which recently present two emerging trends: (i) Collaboration: single-sided
model trained on-cloud (separate learning) to the device-cloud collaborative
recommendation (collaborative learning). (ii) Real-time Dynamic: the network
parameters are the same across all the instances (static model) to adaptive
network parameters generation conditioned on the real-time instances (dynamic
model). The aforementioned two trends enable the device-cloud collaborative and
dynamic recommendation, which deeply exploits the recommendation pattern among
cloud-device data and efficiently characterizes different instances with
different underlying distributions based on the cost of frequent device-cloud
communication. Despite promising, we argue that most of the communications are
unnecessary to request the new parameters of the recommendation system on the
cloud since the on-device data distribution are not always changing. To
alleviate this issue, we designed a Intelligent DEvice-Cloud PArameter Request
ModeL (IDEAL) that can be deployed on the device to calculate the request
revenue with low resource consumption, so as to ensure the adaptive
device-cloud communication with high revenue. We envision a new device
intelligence learning task to implement IDEAL by detecting the data
out-of-domain. Moreover, we map the user's real-time behavior to a normal
distribution, the uncertainty is calculated by the multi-sampling outputs to
measure the generalization ability of the device model to the current user
behavior. Our experimental study demonstrates IDEAL's effectiveness and
generalizability on four public benchmarks, which yield a higher efficient
device-cloud collaborative and dynamic recommendation paradigm
Variational Cross-Graph Reasoning and Adaptive Structured Semantics Learning for Compositional Temporal Grounding
Temporal grounding is the task of locating a specific segment from an
untrimmed video according to a query sentence. This task has achieved
significant momentum in the computer vision community as it enables activity
grounding beyond pre-defined activity classes by utilizing the semantic
diversity of natural language descriptions. The semantic diversity is rooted in
the principle of compositionality in linguistics, where novel semantics can be
systematically described by combining known words in novel ways (compositional
generalization). However, existing temporal grounding datasets are not
carefully designed to evaluate the compositional generalizability. To
systematically benchmark the compositional generalizability of temporal
grounding models, we introduce a new Compositional Temporal Grounding task and
construct two new dataset splits, i.e., Charades-CG and ActivityNet-CG. When
evaluating the state-of-the-art methods on our new dataset splits, we
empirically find that they fail to generalize to queries with novel
combinations of seen words. We argue that the inherent structured semantics
inside the videos and language is the crucial factor to achieve compositional
generalization. Based on this insight, we propose a variational cross-graph
reasoning framework that explicitly decomposes video and language into
hierarchical semantic graphs, respectively, and learns fine-grained semantic
correspondence between the two graphs. Furthermore, we introduce a novel
adaptive structured semantics learning approach to derive the
structure-informed and domain-generalizable graph representations, which
facilitate the fine-grained semantic correspondence reasoning between the two
graphs. Extensive experiments validate the superior compositional
generalizability of our approach.Comment: arXiv admin note: substantial text overlap with arXiv:2203.1304
Revisiting the Domain Shift and Sample Uncertainty in Multi-source Active Domain Transfer
Active Domain Adaptation (ADA) aims to maximally boost model adaptation in a
new target domain by actively selecting a limited number of target data to
annotate.This setting neglects the more practical scenario where training data
are collected from multiple sources. This motivates us to target a new and
challenging setting of knowledge transfer that extends ADA from a single source
domain to multiple source domains, termed Multi-source Active Domain Adaptation
(MADA). Not surprisingly, we find that most traditional ADA methods cannot work
directly in such a setting, mainly due to the excessive domain gap introduced
by all the source domains and thus their uncertainty-aware sample selection can
easily become miscalibrated under the multi-domain shifts. Considering this, we
propose a Dynamic integrated uncertainty valuation framework(Detective) that
comprehensively consider the domain shift between multi-source domains and
target domain to detect the informative target samples. Specifically, the
leverages a dynamic Domain Adaptation(DA) model that learns how to adapt the
model's parameters to fit the union of multi-source domains. This enables an
approximate single-source domain modeling by the dynamic model. We then
comprehensively measure both domain uncertainty and predictive uncertainty in
the target domain to detect informative target samples using evidential deep
learning, thereby mitigating uncertainty miscalibration. Furthermore, we
introduce a contextual diversity-aware calculator to enhance the diversity of
the selected samples. Experiments demonstrate that our solution outperforms
existing methods by a considerable margin on three domain adaptation
benchmarks.Comment: arXiv admin note: text overlap with arXiv:2302.13824 by other author
Dilated Context Integrated Network with Cross-Modal Consensus for Temporal Emotion Localization in Videos
Understanding human emotions is a crucial ability for intelligent robots to
provide better human-robot interactions. The existing works are limited to
trimmed video-level emotion classification, failing to locate the temporal
window corresponding to the emotion. In this paper, we introduce a new task,
named Temporal Emotion Localization in videos~(TEL), which aims to detect human
emotions and localize their corresponding temporal boundaries in untrimmed
videos with aligned subtitles. TEL presents three unique challenges compared to
temporal action localization: 1) The emotions have extremely varied temporal
dynamics; 2) The emotion cues are embedded in both appearances and complex
plots; 3) The fine-grained temporal annotations are complicated and
labor-intensive. To address the first two challenges, we propose a novel
dilated context integrated network with a coarse-fine two-stream architecture.
The coarse stream captures varied temporal dynamics by modeling
multi-granularity temporal contexts. The fine stream achieves complex plots
understanding by reasoning the dependency between the multi-granularity
temporal contexts from the coarse stream and adaptively integrates them into
fine-grained video segment features. To address the third challenge, we
introduce a cross-modal consensus learning paradigm, which leverages the
inherent semantic consensus between the aligned video and subtitle to achieve
weakly-supervised learning. We contribute a new testing set with 3,000
manually-annotated temporal boundaries so that future research on the TEL
problem can be quantitatively evaluated. Extensive experiments show the
effectiveness of our approach on temporal emotion localization. The repository
of this work is at
https://github.com/YYJMJC/Temporal-Emotion-Localization-in-Videos.Comment: Accepted by ACM Multimedia 202
De-fine: Decomposing and Refining Visual Programs with Auto-Feedback
Visual programming, a modular and generalizable paradigm, integrates
different modules and Python operators to solve various vision-language tasks.
Unlike end-to-end models that need task-specific data, it advances in
performing visual processing and reasoning in an unsupervised manner. Current
visual programming methods generate programs in a single pass for each task
where the ability to evaluate and optimize based on feedback, unfortunately, is
lacking, which consequentially limits their effectiveness for complex,
multi-step problems. Drawing inspiration from benders decomposition, we
introduce De-fine, a general framework that automatically decomposes complex
tasks into simpler subtasks and refines programs through auto-feedback. This
model-agnostic approach can improve logical reasoning performance by
integrating the strengths of multiple models. Our experiments across various
visual tasks show that De-fine creates more accurate and robust programs,
setting new benchmarks in the field
Gradient-Regulated Meta-Prompt Learning for Generalizable Vision-Language Models
Prompt tuning, a recently emerging paradigm, enables the powerful
vision-language pre-training models to adapt to downstream tasks in a parameter
-- and data -- efficient way, by learning the ``soft prompts'' to condition
frozen pre-training models. Though effective, it is particularly problematic in
the few-shot scenario, where prompt tuning performance is sensitive to the
initialization and requires a time-consuming process to find a good
initialization, thus restricting the fast adaptation ability of the
pre-training models. In addition, prompt tuning could undermine the
generalizability of the pre-training models, because the learnable prompt
tokens are easy to overfit to the limited training samples. To address these
issues, we introduce a novel Gradient-RegulAted Meta-prompt learning (GRAM)
framework that jointly meta-learns an efficient soft prompt initialization for
better adaptation and a lightweight gradient regulating function for strong
cross-domain generalizability in a meta-learning paradigm using only the
unlabeled image-text pre-training data. Rather than designing a specific prompt
tuning method, our GRAM can be easily incorporated into various prompt tuning
methods in a model-agnostic way, and comprehensive experiments show that GRAM
brings about consistent improvement for them in several settings (i.e.,
few-shot learning, cross-domain generalization, cross-dataset generalization,
etc.) over 11 datasets. Further, experiments show that GRAM enables the
orthogonal methods of textual and visual prompt tuning to work in a
mutually-enhanced way, offering better generalizability beyond the uni-modal
prompt tuning methods.Comment: Accepted by ICCV 202
- …